1) Understand the domain expertise ask questions
2) Aquiring data
3) First glance data (can be skipped sometimes)
4) Sample if needed
5) Given a memory fitted dataframe we can future investigate
6) Consider missing values and outliers
7) Now we can do some heavy lifting exploration
1) Understand the domain expertise ask questions
2) Aquiring data
You can aquire the data from following resources:
3) First glance data (can be skipped sometimes)
Once you know the buisness proccess, and you have the data in your end, You should play around with your data.
3) If data doesn't fit the RAM:
We use the following:
- Naively spliting the file to smaller files
- Sampling randomly the files
5) Given a memory fitted dataframe we can future investigate
- Data metadata info , Number of records number of bytes
- Warnings regarding the data like high correlation missing values etc
- Identify the schema of the data aka data type and category of the variables
- Look at means, median, standard deviation and histograms to understand the distribution
- Check Completeness , Are critical data values missing? A database with missing data values is not unusual, but when the information missing is critical, then completeness is an issue.
- Check Conformity, Is the data following standard data definitions?
For example, are dates in a standard format? Maintaining conformity to standard formats are important to maintaining consistent structure and nomenclature for sharing and internal data management.
Are your data values correct
- note: still missing Continuous Variables plotbox
- note: Consider practical significance, small can be sometimes usefull and big can be useless
- Bi-variate Analysis finds out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined significance level.
- Bi-variate Analysis finds out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined significance level.
- Small analysis
The following issues need to be taken into considerations in this part as well
- Check Timeliness, Is the data available when expected and needed? Timeliness depends on the user’s expectations and needs. Relevant only for resources where acquiring the data is very fact
- Check Consistency ,Does the data across several systems reflect the same information? If data is reported across multiple systems, it should have the same information.
- Check Integrity Is the data valid across the relationships and can all the data in a database be traced and connected? For example, in a customer database there should be a valid customer/sales relationship.
6) Investigate outliers and missing values
Sometime the missing values holds pattern in them, we should using the following:
</br>
Sometime the outliers are actualy the most intresting part of the data thus its very important part One can find outliers( Univariate and Multivariate) using the following:
after the outliers as been found one should exlore the data like in #5
7) Now we can do some heavy lifting exploration:
Now you suppose to understand the data and be able able to actual answer
stuff to check in the future: